Modern high throughput sequencers can generate hundreds of millions of sequences in a single run. Before analysing this sequence to draw biological conclusions you should always perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in your data which may affect how you can usefully use it.
Most sequencers will generate a QC report as part of their analysis pipeline, but this is usually only focused on identifying problems which were generated by the sequencer itself. FastQC aims to provide a QC report which can spot problems which originate either in the sequencer or in the starting library material.
FastQC can be run in one of two modes. It can either run as a stand alone interactive application for the immediate analysis of small numbers of FastQ files, or it can be run in a non-interactive mode where it would be suitable for integrating into a larger analysis pipeline for the systematic processing of large numbers of files.
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
Cutadapt helps with trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter single-end and paired-end reads in various ways. Adapter sequences can contain IUPAC wildcard characters. Cutadapt can also demultiplex your reads.
In this tutorial we are going to learn how to use FastQC and Cutadapt to quality control and trim adapter sequences from FASTQ files. The first thing to do is create a directory to store all the tutorial data. It is good practice to create a new directory for each project you work on, this ensures files do not get mixed up and all the results are self-contained.
Create a ‘tutorial’ directory to store output files:
mkdir tutorial
Download the tutorial and exercise data:
curl https://media.githubusercontent.com/media/zifornd/bioinformatics-workshop/main/data/quality/data.tar.gz --output tutorial/data.tar.gz
## % Total % Received % Xferd Average Speed Time Time Time Current
## Dload Upload Total Spent Left Speed
##
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
0 61.4M 0 13851 0 0 3834 0 4:40:07 0:00:03 4:40:04 3838
0 61.4M 0 607k 0 0 133k 0 0:07:51 0:00:04 0:07:47 133k
1 61.4M 1 751k 0 0 135k 0 0:07:45 0:00:05 0:07:40 177k
1 61.4M 1 783k 0 0 118k 0 0:08:49 0:00:06 0:08:43 183k
1 61.4M 1 863k 0 0 114k 0 0:09:10 0:00:07 0:09:03 204k
1 61.4M 1 879k 0 0 102k 0 0:10:13 0:00:08 0:10:05 174k
1 61.4M 1 927k 0 0 99353 0 0:10:48 0:00:09 0:10:39 65392
1 61.4M 1 943k 0 0 91130 0 0:11:47 0:00:10 0:11:37 39009
1 61.4M 1 991k 0 0 87684 0 0:12:14 0:00:11 0:12:03 42647
1 61.4M 1 1055k 0 0 85399 0 0:12:34 0:00:12 0:12:22 38565
1 61.4M 1 1103k 0 0 83304 0 0:12:53 0:00:13 0:12:40 45943
2 61.4M 2 1280k 0 0 89725 0 0:11:58 0:00:14 0:11:44 71478
2 61.4M 2 1311k 0 0 86350 0 0:12:26 0:00:15 0:12:11 76098
2 61.4M 2 1359k 0 0 83693 0 0:12:49 0:00:16 0:12:33 74548
2 61.4M 2 1408k 0 0 81933 0 0:13:06 0:00:17 0:12:49 73027
2 61.4M 2 1487k 0 0 82079 0 0:13:05 0:00:18 0:12:47 78737
2 61.4M 2 1695k 0 0 88818 0 0:12:05 0:00:19 0:11:46 86123
2 61.4M 2 1839k 0 0 91385 0 0:11:45 0:00:20 0:11:25 104k
3 61.4M 3 1935k 0 0 91860 0 0:11:41 0:00:21 0:11:20 116k
3 61.4M 3 1999k 0 0 90503 0 0:11:52 0:00:22 0:11:30 117k
3 61.4M 3 2048k 0 0 88905 0 0:12:04 0:00:23 0:11:41 111k
3 61.4M 3 2143k 0 0 89399 0 0:12:00 0:00:24 0:11:36 91677
3 61.4M 3 2399k 0 0 95894 0 0:11:11 0:00:25 0:10:46 111k
3 61.4M 3 2463k 0 0 94919 0 0:11:18 0:00:26 0:10:52 105k
4 61.4M 4 2575k 0 0 95695 0 0:11:13 0:00:27 0:10:46 116k
4 61.4M 4 2688k 0 0 96391 0 0:11:08 0:00:28 0:10:40 128k
4 61.4M 4 2783k 0 0 96384 0 0:11:08 0:00:29 0:10:39 127k
4 61.4M 4 2816k 0 0 94339 0 0:11:23 0:00:30 0:10:53 86252
4 61.4M 4 2880k 0 0 93401 0 0:11:29 0:00:31 0:10:58 85319
4 61.4M 4 3008k 0 0 94598 0 0:11:21 0:00:32 0:10:49 88528
5 61.4M 5 3200k 0 0 97628 0 0:11:00 0:00:33 0:10:27 102k
5 61.4M 5 3328k 0 0 98404 0 0:10:54 0:00:34 0:10:20 107k
5 61.4M 5 3407k 0 0 98141 0 0:10:56 0:00:35 0:10:21 118k
5 61.4M 5 3487k 0 0 97676 0 0:10:59 0:00:36 0:10:23 121k
5 61.4M 5 3567k 0 0 97288 0 0:11:02 0:00:37 0:10:25 112k
5 61.4M 5 3631k 0 0 96456 0 0:11:08 0:00:38 0:10:30 88574
5 61.4M 5 3679k 0 0 95216 0 0:11:16 0:00:39 0:10:37 72887
5 61.4M 5 3712k 0 0 93734 0 0:11:27 0:00:40 0:10:47 62335
5 61.4M 5 3743k 0 0 92230 0 0:11:38 0:00:41 0:10:57 52418
6 61.4M 6 3776k 0 0 90862 0 0:11:49 0:00:42 0:11:07 42603
6 61.4M 6 3919k 0 0 92141 0 0:11:39 0:00:43 0:10:56 58915
6 61.4M 6 4079k 0 0 93775 0 0:11:27 0:00:44 0:10:43 82296
6 61.4M 6 4367k 0 0 98187 0 0:10:56 0:00:45 0:10:11 131k
7 61.4M 7 4815k 0 0 103k 0 0:10:08 0:00:46 0:09:22 213k
7 61.4M 7 4975k 0 0 104k 0 0:10:01 0:00:47 0:09:14 240k
8 61.4M 8 5167k 0 0 106k 0 0:09:51 0:00:48 0:09:03 249k
8 61.4M 8 5327k 0 0 107k 0 0:09:45 0:00:49 0:08:56 249k
8 61.4M 8 5504k 0 0 108k 0 0:09:38 0:00:50 0:08:48 226k
9 61.4M 9 5727k 0 0 111k 0 0:09:26 0:00:51 0:08:35 183k
9 61.4M 9 5935k 0 0 112k 0 0:09:17 0:00:52 0:08:25 191k
9 61.4M 9 6127k 0 0 114k 0 0:09:10 0:00:53 0:08:17 192k
10 61.4M 10 6319k 0 0 115k 0 0:09:03 0:00:54 0:08:09 198k
10 61.4M 10 6735k 0 0 121k 0 0:08:39 0:00:55 0:07:44 246k
11 61.4M 11 7055k 0 0 124k 0 0:08:24 0:00:56 0:07:28 265k
11 61.4M 11 7311k 0 0 127k 0 0:08:15 0:00:57 0:07:18 275k
12 61.4M 12 7552k 0 0 128k 0 0:08:07 0:00:58 0:07:09 284k
12 61.4M 12 7759k 0 0 130k 0 0:08:03 0:00:59 0:07:04 286k
12 61.4M 12 7872k 0 0 129k 0 0:08:04 0:01:00 0:07:04 225k
12 61.4M 12 7936k 0 0 128k 0 0:08:08 0:01:01 0:07:07 173k
12 61.4M 12 8047k 0 0 128k 0 0:08:09 0:01:02 0:07:07 146k
13 61.4M 13 8223k 0 0 129k 0 0:08:06 0:01:03 0:07:03 133k
13 61.4M 13 8448k 0 0 130k 0 0:08:00 0:01:04 0:06:56 138k
14 61.4M 14 8943k 0 0 136k 0 0:07:41 0:01:05 0:06:36 216k
15 61.4M 15 9503k 0 0 142k 0 0:07:20 0:01:06 0:06:14 317k
16 61.4M 16 9.8M 0 0 149k 0 0:06:59 0:01:07 0:05:52 416k
16 61.4M 16 10.1M 0 0 151k 0 0:06:56 0:01:08 0:05:48 430k
16 61.4M 16 10.3M 0 0 152k 0 0:06:52 0:01:09 0:05:43 431k
17 61.4M 17 10.5M 0 0 152k 0 0:06:52 0:01:10 0:05:42 364k
17 61.4M 17 10.6M 0 0 152k 0 0:06:53 0:01:11 0:05:42 278k
17 61.4M 17 10.7M 0 0 151k 0 0:06:54 0:01:12 0:05:42 178k
17 61.4M 17 10.9M 0 0 152k 0 0:06:53 0:01:13 0:05:40 166k
18 61.4M 18 11.1M 0 0 152k 0 0:06:51 0:01:14 0:05:37 158k
18 61.4M 18 11.3M 0 0 154k 0 0:06:47 0:01:15 0:05:32 179k
19 61.4M 19 11.7M 0 0 156k 0 0:06:40 0:01:16 0:05:24 223k
20 61.4M 20 12.4M 0 0 164k 0 0:06:22 0:01:17 0:05:05 349k
21 61.4M 21 12.9M 0 0 168k 0 0:06:13 0:01:18 0:04:55 408k
21 61.4M 21 13.2M 0 0 170k 0 0:06:09 0:01:19 0:04:50 430k
21 61.4M 21 13.3M 0 0 169k 0 0:06:11 0:01:20 0:04:51 393k
21 61.4M 21 13.3M 0 0 168k 0 0:06:14 0:01:21 0:04:53 339k
22 61.4M 22 13.5M 0 0 168k 0 0:06:14 0:01:22 0:04:52 220k
22 61.4M 22 13.7M 0 0 168k 0 0:06:12 0:01:23 0:04:49 176k
22 61.4M 22 14.0M 0 0 169k 0 0:06:10 0:01:24 0:04:46 162k
23 61.4M 23 14.2M 0 0 170k 0 0:06:08 0:01:25 0:04:43 195k
23 61.4M 23 14.7M 0 0 174k 0 0:06:01 0:01:26 0:04:35 272k
24 61.4M 24 15.2M 0 0 177k 0 0:05:53 0:01:27 0:04:26 342k
25 61.4M 25 15.7M 0 0 182k 0 0:05:44 0:01:28 0:04:16 408k
26 61.4M 26 16.1M 0 0 184k 0 0:05:41 0:01:29 0:04:12 428k
26 61.4M 26 16.4M 0 0 185k 0 0:05:38 0:01:30 0:04:08 444k
27 61.4M 27 16.6M 0 0 186k 0 0:05:37 0:01:31 0:04:06 399k
27 61.4M 27 16.8M 0 0 186k 0 0:05:37 0:01:32 0:04:05 335k
27 61.4M 27 17.1M 0 0 187k 0 0:05:35 0:01:33 0:04:02 281k
28 61.4M 28 17.3M 0 0 188k 0 0:05:34 0:01:34 0:04:00 262k
28 61.4M 28 17.7M 0 0 189k 0 0:05:31 0:01:35 0:03:56 259k
29 61.4M 29 18.2M 0 0 193k 0 0:05:25 0:01:36 0:03:49 317k
30 61.4M 30 18.7M 0 0 197k 0 0:05:18 0:01:37 0:03:41 397k
30 61.4M 30 18.9M 0 0 196k 0 0:05:19 0:01:38 0:03:41 367k
31 61.4M 31 19.0M 0 0 196k 0 0:05:20 0:01:39 0:03:41 342k
31 61.4M 31 19.0M 0 0 194k 0 0:05:23 0:01:40 0:03:43 284k
31 61.4M 31 19.1M 0 0 193k 0 0:05:25 0:01:41 0:03:44 191k
31 61.4M 31 19.1M 0 0 191k 0 0:05:28 0:01:42 0:03:46 81664
31 61.4M 31 19.2M 0 0 190k 0 0:05:30 0:01:43 0:03:47 58990
31 61.4M 31 19.4M 0 0 190k 0 0:05:30 0:01:44 0:03:46 84634
31 61.4M 31 19.5M 0 0 189k 0 0:05:31 0:01:45 0:03:46 99k
32 61.4M 32 19.8M 0 0 190k 0 0:05:30 0:01:46 0:03:44 134k
32 61.4M 32 19.9M 0 0 190k 0 0:05:31 0:01:47 0:03:44 159k
32 61.4M 32 20.0M 0 0 189k 0 0:05:32 0:01:48 0:03:44 165k
32 61.4M 32 20.0M 0 0 187k 0 0:05:35 0:01:49 0:03:46 128k
32 61.4M 32 20.2M 0 0 187k 0 0:05:36 0:01:50 0:03:46 128k
33 61.4M 33 20.4M 0 0 187k 0 0:05:35 0:01:51 0:03:44 124k
33 61.4M 33 20.6M 0 0 188k 0 0:05:34 0:01:52 0:03:42 144k
34 61.4M 34 20.9M 0 0 188k 0 0:05:33 0:01:53 0:03:40 176k
34 61.4M 34 21.2M 0 0 189k 0 0:05:31 0:01:54 0:03:37 229k
34 61.4M 34 21.3M 0 0 189k 0 0:05:32 0:01:55 0:03:37 233k
35 61.4M 35 21.5M 0 0 189k 0 0:05:31 0:01:56 0:03:35 236k
35 61.4M 35 21.6M 0 0 188k 0 0:05:33 0:01:57 0:03:36 200k
35 61.4M 35 21.7M 0 0 188k 0 0:05:34 0:01:58 0:03:36 179k
35 61.4M 35 21.9M 0 0 187k 0 0:05:35 0:01:59 0:03:36 141k
35 61.4M 35 22.0M 0 0 187k 0 0:05:35 0:02:00 0:03:35 150k
36 61.4M 36 22.2M 0 0 187k 0 0:05:35 0:02:01 0:03:34 147k
36 61.4M 36 22.4M 0 0 187k 0 0:05:35 0:02:02 0:03:33 161k
36 61.4M 36 22.6M 0 0 187k 0 0:05:35 0:02:03 0:03:32 172k
37 61.4M 37 22.8M 0 0 188k 0 0:05:34 0:02:04 0:03:30 202k
37 61.4M 37 23.2M 0 0 189k 0 0:05:32 0:02:05 0:03:27 236k
38 61.4M 38 23.6M 0 0 191k 0 0:05:29 0:02:06 0:03:23 268k
39 61.4M 39 24.1M 0 0 193k 0 0:05:24 0:02:07 0:03:17 345k
40 61.4M 40 24.6M 0 0 196k 0 0:05:20 0:02:08 0:03:12 419k
40 61.4M 40 24.8M 0 0 196k 0 0:05:20 0:02:09 0:03:11 391k
40 61.4M 40 24.9M 0 0 195k 0 0:05:22 0:02:10 0:03:12 342k
40 61.4M 40 25.0M 0 0 194k 0 0:05:23 0:02:11 0:03:12 284k
40 61.4M 40 25.1M 0 0 194k 0 0:05:24 0:02:12 0:03:12 207k
41 61.4M 41 25.3M 0 0 194k 0 0:05:23 0:02:13 0:03:10 137k
41 61.4M 41 25.6M 0 0 195k 0 0:05:22 0:02:14 0:03:08 161k
42 61.4M 42 26.0M 0 0 196k 0 0:05:19 0:02:15 0:03:04 232k
43 61.4M 43 26.5M 0 0 198k 0 0:05:16 0:02:16 0:03:00 307k
43 61.4M 43 27.0M 0 0 200k 0 0:05:13 0:02:17 0:02:56 381k
44 61.4M 44 27.5M 0 0 203k 0 0:05:08 0:02:18 0:02:50 457k
46 61.4M 46 28.2M 0 0 207k 0 0:05:03 0:02:19 0:02:44 538k
46 61.4M 46 28.6M 0 0 208k 0 0:05:01 0:02:20 0:02:41 527k
47 61.4M 47 28.9M 0 0 209k 0 0:05:00 0:02:21 0:02:39 505k
47 61.4M 47 29.2M 0 0 210k 0 0:04:59 0:02:22 0:02:37 464k
48 61.4M 48 29.8M 0 0 212k 0 0:04:55 0:02:23 0:02:32 466k
49 61.4M 49 30.2M 0 0 214k 0 0:04:53 0:02:24 0:02:29 403k
49 61.4M 49 30.5M 0 0 215k 0 0:04:52 0:02:25 0:02:27 400k
50 61.4M 50 30.8M 0 0 215k 0 0:04:51 0:02:26 0:02:25 389k
50 61.4M 50 31.2M 0 0 216k 0 0:04:50 0:02:27 0:02:23 402k
51 61.4M 51 31.6M 0 0 218k 0 0:04:48 0:02:28 0:02:20 371k
52 61.4M 52 32.2M 0 0 220k 0 0:04:44 0:02:29 0:02:15 415k
53 61.4M 53 32.8M 0 0 223k 0 0:04:41 0:02:30 0:02:11 456k
54 61.4M 54 33.2M 0 0 224k 0 0:04:40 0:02:31 0:02:09 487k
54 61.4M 54 33.7M 0 0 226k 0 0:04:38 0:02:32 0:02:06 506k
55 61.4M 55 33.9M 0 0 226k 0 0:04:37 0:02:33 0:02:04 469k
55 61.4M 55 34.0M 0 0 225k 0 0:04:38 0:02:34 0:02:04 367k
55 61.4M 55 34.2M 0 0 225k 0 0:04:39 0:02:35 0:02:04 289k
55 61.4M 55 34.3M 0 0 224k 0 0:04:40 0:02:36 0:02:04 225k
56 61.4M 56 34.5M 0 0 224k 0 0:04:40 0:02:37 0:02:03 165k
56 61.4M 56 34.6M 0 0 223k 0 0:04:41 0:02:38 0:02:03 137k
56 61.4M 56 34.7M 0 0 223k 0 0:04:42 0:02:39 0:02:03 138k
56 61.4M 56 34.8M 0 0 222k 0 0:04:42 0:02:40 0:02:02 134k
57 61.4M 57 35.0M 0 0 222k 0 0:04:42 0:02:41 0:02:01 151k
57 61.4M 57 35.2M 0 0 222k 0 0:04:43 0:02:42 0:02:01 154k
57 61.4M 57 35.4M 0 0 221k 0 0:04:43 0:02:43 0:02:00 160k
57 61.4M 57 35.5M 0 0 221k 0 0:04:44 0:02:44 0:02:00 169k
58 61.4M 58 35.8M 0 0 221k 0 0:04:44 0:02:45 0:01:59 186k
58 61.4M 58 36.0M 0 0 221k 0 0:04:43 0:02:46 0:01:57 201k
58 61.4M 58 36.2M 0 0 221k 0 0:04:44 0:02:47 0:01:57 192k
59 61.4M 59 36.2M 0 0 220k 0 0:04:45 0:02:48 0:01:57 172k
59 61.4M 59 36.3M 0 0 219k 0 0:04:46 0:02:49 0:01:57 154k
59 61.4M 59 36.3M 0 0 218k 0 0:04:48 0:02:50 0:01:58 118k
59 61.4M 59 36.4M 0 0 217k 0 0:04:49 0:02:51 0:01:58 75038
59 61.4M 59 36.4M 0 0 216k 0 0:04:50 0:02:52 0:01:58 59014
59 61.4M 59 36.5M 0 0 215k 0 0:04:51 0:02:53 0:01:58 58187
59 61.4M 59 36.5M 0 0 214k 0 0:04:53 0:02:54 0:01:59 55681
59 61.4M 59 36.6M 0 0 213k 0 0:04:54 0:02:55 0:01:59 59442
59 61.4M 59 36.7M 0 0 213k 0 0:04:55 0:02:56 0:01:59 68879
60 61.4M 60 36.9M 0 0 212k 0 0:04:55 0:02:57 0:01:58 88493
60 61.4M 60 37.0M 0 0 212k 0 0:04:56 0:02:58 0:01:58 96134
60 61.4M 60 37.0M 0 0 211k 0 0:04:57 0:02:59 0:01:58 94571
60 61.4M 60 37.1M 0 0 210k 0 0:04:58 0:03:00 0:01:58 102k
60 61.4M 60 37.3M 0 0 210k 0 0:04:58 0:03:01 0:01:57 128k
61 61.4M 61 37.5M 0 0 210k 0 0:04:58 0:03:02 0:01:56 135k
61 61.4M 61 37.7M 0 0 210k 0 0:04:58 0:03:03 0:01:55 159k
61 61.4M 61 37.9M 0 0 210k 0 0:04:58 0:03:04 0:01:54 183k
62 61.4M 62 38.1M 0 0 210k 0 0:04:59 0:03:05 0:01:54 195k
62 61.4M 62 38.4M 0 0 210k 0 0:04:58 0:03:06 0:01:52 201k
62 61.4M 62 38.5M 0 0 210k 0 0:04:58 0:03:07 0:01:51 202k
62 61.4M 62 38.6M 0 0 209k 0 0:04:59 0:03:08 0:01:51 178k
62 61.4M 62 38.6M 0 0 208k 0 0:05:01 0:03:09 0:01:52 145k
63 61.4M 63 38.7M 0 0 208k 0 0:05:02 0:03:10 0:01:52 121k
63 61.4M 63 38.7M 0 0 206k 0 0:05:04 0:03:12 0:01:52 71290
63 61.4M 63 38.7M 0 0 206k 0 0:05:05 0:03:12 0:01:53 44910
63 61.4M 63 38.7M 0 0 205k 0 0:05:06 0:03:13 0:01:53 32692
63 61.4M 63 38.8M 0 0 204k 0 0:05:07 0:03:14 0:01:53 46248
63 61.4M 63 38.9M 0 0 203k 0 0:05:08 0:03:15 0:01:53 42656
63 61.4M 63 38.9M 0 0 203k 0 0:05:09 0:03:16 0:01:53 50386
63 61.4M 63 39.0M 0 0 202k 0 0:05:11 0:03:17 0:01:54 54028
63 61.4M 63 39.0M 0 0 201k 0 0:05:12 0:03:18 0:01:54 59365
63 61.4M 63 39.1M 0 0 200k 0 0:05:14 0:03:19 0:01:55 46531
63 61.4M 63 39.1M 0 0 199k 0 0:05:15 0:03:20 0:01:55 38520
63 61.4M 63 39.1M 0 0 198k 0 0:05:16 0:03:21 0:01:55 29609
63 61.4M 63 39.2M 0 0 198k 0 0:05:17 0:03:22 0:01:55 39025
63 61.4M 63 39.2M 0 0 197k 0 0:05:18 0:03:23 0:01:55 39188
63 61.4M 63 39.2M 0 0 196k 0 0:05:20 0:03:24 0:01:56 40562
64 61.4M 64 39.3M 0 0 195k 0 0:05:21 0:03:25 0:01:56 50208
64 61.4M 64 39.4M 0 0 195k 0 0:05:22 0:03:26 0:01:56 61956
64 61.4M 64 39.5M 0 0 194k 0 0:05:22 0:03:27 0:01:55 67356
64 61.4M 64 39.5M 0 0 194k 0 0:05:24 0:03:28 0:01:56 64188
64 61.4M 64 39.6M 0 0 193k 0 0:05:25 0:03:29 0:01:56 66791
64 61.4M 64 39.6M 0 0 192k 0 0:05:26 0:03:30 0:01:56 68673
64 61.4M 64 39.7M 0 0 192k 0 0:05:27 0:03:31 0:01:56 62448
64 61.4M 64 39.7M 0 0 191k 0 0:05:28 0:03:32 0:01:56 47427
64 61.4M 64 39.7M 0 0 190k 0 0:05:29 0:03:33 0:01:56 46400
64 61.4M 64 39.8M 0 0 190k 0 0:05:31 0:03:34 0:01:57 46332
64 61.4M 64 39.8M 0 0 189k 0 0:05:32 0:03:35 0:01:57 42706
65 61.4M 65 40.0M 0 0 189k 0 0:05:32 0:03:36 0:01:56 65418
65 61.4M 65 40.1M 0 0 188k 0 0:05:33 0:03:37 0:01:56 67647
65 61.4M 65 40.1M 0 0 188k 0 0:05:34 0:03:38 0:01:56 79776
65 61.4M 65 40.3M 0 0 188k 0 0:05:34 0:03:39 0:01:55 102k
65 61.4M 65 40.4M 0 0 187k 0 0:05:35 0:03:40 0:01:55 111k
65 61.4M 65 40.5M 0 0 187k 0 0:05:36 0:03:41 0:01:55 99k
66 61.4M 66 40.6M 0 0 186k 0 0:05:36 0:03:42 0:01:54 111k
66 61.4M 66 40.7M 0 0 186k 0 0:05:37 0:03:43 0:01:54 111k
66 61.4M 66 40.7M 0 0 186k 0 0:05:38 0:03:44 0:01:54 98359
66 61.4M 66 40.9M 0 0 185k 0 0:05:38 0:03:45 0:01:53 112k
67 61.4M 67 41.2M 0 0 186k 0 0:05:37 0:03:46 0:01:51 143k
67 61.4M 67 41.5M 0 0 187k 0 0:05:36 0:03:47 0:01:49 201k
68 61.4M 68 41.9M 0 0 188k 0 0:05:34 0:03:48 0:01:46 258k
68 61.4M 68 42.2M 0 0 188k 0 0:05:34 0:03:49 0:01:45 287k
68 61.4M 68 42.3M 0 0 188k 0 0:05:34 0:03:50 0:01:44 287k
69 61.4M 69 42.5M 0 0 188k 0 0:05:34 0:03:51 0:01:43 274k
69 61.4M 69 42.7M 0 0 188k 0 0:05:34 0:03:52 0:01:42 233k
69 61.4M 69 42.9M 0 0 188k 0 0:05:34 0:03:53 0:01:41 193k
70 61.4M 70 43.1M 0 0 188k 0 0:05:34 0:03:54 0:01:40 184k
70 61.4M 70 43.2M 0 0 187k 0 0:05:34 0:03:55 0:01:39 177k
70 61.4M 70 43.4M 0 0 187k 0 0:05:34 0:03:56 0:01:38 175k
70 61.4M 70 43.6M 0 0 188k 0 0:05:34 0:03:57 0:01:37 182k
71 61.4M 71 43.9M 0 0 188k 0 0:05:33 0:03:58 0:01:35 204k
72 61.4M 72 44.2M 0 0 189k 0 0:05:32 0:03:59 0:01:33 241k
72 61.4M 72 44.5M 0 0 189k 0 0:05:32 0:04:00 0:01:32 260k
72 61.4M 72 44.6M 0 0 189k 0 0:05:32 0:04:01 0:01:31 250k
72 61.4M 72 44.7M 0 0 188k 0 0:05:33 0:04:02 0:01:31 230k
73 61.4M 73 44.9M 0 0 188k 0 0:05:33 0:04:03 0:01:30 201k
73 61.4M 73 45.0M 0 0 188k 0 0:05:33 0:04:04 0:01:29 153k
73 61.4M 73 45.1M 0 0 188k 0 0:05:34 0:04:05 0:01:29 133k
73 61.4M 73 45.4M 0 0 188k 0 0:05:33 0:04:06 0:01:27 155k
74 61.4M 74 45.5M 0 0 188k 0 0:05:34 0:04:07 0:01:27 159k
74 61.4M 74 45.6M 0 0 187k 0 0:05:34 0:04:08 0:01:26 146k
74 61.4M 74 45.7M 0 0 187k 0 0:05:35 0:04:09 0:01:26 143k
74 61.4M 74 45.9M 0 0 187k 0 0:05:35 0:04:10 0:01:25 153k
74 61.4M 74 46.0M 0 0 187k 0 0:05:35 0:04:11 0:01:24 131k
75 61.4M 75 46.1M 0 0 186k 0 0:05:36 0:04:12 0:01:24 118k
75 61.4M 75 46.1M 0 0 186k 0 0:05:37 0:04:13 0:01:24 106k
75 61.4M 75 46.2M 0 0 185k 0 0:05:38 0:04:14 0:01:24 98k
75 61.4M 75 46.2M 0 0 185k 0 0:05:39 0:04:15 0:01:24 72237
75 61.4M 75 46.2M 0 0 184k 0 0:05:40 0:04:16 0:01:24 51961
75 61.4M 75 46.3M 0 0 184k 0 0:05:41 0:04:17 0:01:24 55699
75 61.4M 75 46.5M 0 0 184k 0 0:05:41 0:04:18 0:01:23 89892
76 61.4M 76 46.8M 0 0 184k 0 0:05:40 0:04:19 0:01:21 138k
76 61.4M 76 47.1M 0 0 185k 0 0:05:39 0:04:20 0:01:19 172k
76 61.4M 76 47.2M 0 0 185k 0 0:05:40 0:04:21 0:01:19 201k
77 61.4M 77 47.3M 0 0 184k 0 0:05:40 0:04:22 0:01:18 203k
77 61.4M 77 47.4M 0 0 184k 0 0:05:41 0:04:23 0:01:18 185k
77 61.4M 77 47.5M 0 0 184k 0 0:05:41 0:04:24 0:01:17 140k
77 61.4M 77 47.6M 0 0 183k 0 0:05:42 0:04:25 0:01:17 101k
77 61.4M 77 47.6M 0 0 183k 0 0:05:43 0:04:26 0:01:17 85091
77 61.4M 77 47.7M 0 0 182k 0 0:05:44 0:04:27 0:01:17 78752
77 61.4M 77 47.7M 0 0 182k 0 0:05:45 0:04:28 0:01:17 65457
77 61.4M 77 47.8M 0 0 181k 0 0:05:46 0:04:29 0:01:17 52271
77 61.4M 77 47.9M 0 0 181k 0 0:05:46 0:04:30 0:01:16 65680
78 61.4M 78 48.0M 0 0 181k 0 0:05:47 0:04:31 0:01:16 75458
78 61.4M 78 48.3M 0 0 181k 0 0:05:46 0:04:32 0:01:14 134k
79 61.4M 79 48.7M 0 0 182k 0 0:05:44 0:04:33 0:01:11 204k
79 61.4M 79 48.9M 0 0 182k 0 0:05:44 0:04:34 0:01:10 227k
79 61.4M 79 49.1M 0 0 182k 0 0:05:44 0:04:35 0:01:09 246k
80 61.4M 80 49.4M 0 0 183k 0 0:05:43 0:04:36 0:01:07 287k
80 61.4M 80 49.6M 0 0 183k 0 0:05:43 0:04:37 0:01:06 262k
81 61.4M 81 49.9M 0 0 183k 0 0:05:42 0:04:38 0:01:04 230k
81 61.4M 81 50.3M 0 0 184k 0 0:05:41 0:04:39 0:01:02 284k
82 61.4M 82 50.5M 0 0 184k 0 0:05:41 0:04:40 0:01:01 289k
82 61.4M 82 50.5M 0 0 183k 0 0:05:42 0:04:41 0:01:01 235k
82 61.4M 82 50.6M 0 0 183k 0 0:05:42 0:04:42 0:01:00 201k
82 61.4M 82 50.7M 0 0 183k 0 0:05:43 0:04:43 0:01:00 162k
82 61.4M 82 50.7M 0 0 182k 0 0:05:44 0:04:44 0:01:00 91750
82 61.4M 82 50.8M 0 0 182k 0 0:05:45 0:04:45 0:01:00 59211
82 61.4M 82 50.9M 0 0 182k 0 0:05:45 0:04:46 0:00:59 75670
83 61.4M 83 51.0M 0 0 181k 0 0:05:46 0:04:47 0:00:59 84057
83 61.4M 83 51.1M 0 0 181k 0 0:05:46 0:04:48 0:00:58 85439
83 61.4M 83 51.1M 0 0 181k 0 0:05:47 0:04:49 0:00:58 88140
83 61.4M 83 51.2M 0 0 180k 0 0:05:48 0:04:50 0:00:58 98261
83 61.4M 83 51.3M 0 0 180k 0 0:05:48 0:04:51 0:00:57 91805
83 61.4M 83 51.5M 0 0 180k 0 0:05:48 0:04:52 0:00:56 103k
84 61.4M 84 51.6M 0 0 180k 0 0:05:49 0:04:53 0:00:56 111k
84 61.4M 84 51.7M 0 0 180k 0 0:05:49 0:04:54 0:00:55 121k
84 61.4M 84 51.9M 0 0 180k 0 0:05:49 0:04:55 0:00:54 140k
84 61.4M 84 52.2M 0 0 180k 0 0:05:49 0:04:56 0:00:53 169k
85 61.4M 85 52.4M 0 0 180k 0 0:05:48 0:04:57 0:00:51 189k
85 61.4M 85 52.6M 0 0 180k 0 0:05:48 0:04:58 0:00:50 205k
86 61.4M 86 52.8M 0 0 180k 0 0:05:48 0:04:59 0:00:49 224k
86 61.4M 86 53.1M 0 0 181k 0 0:05:47 0:05:00 0:00:47 239k
86 61.4M 86 53.2M 0 0 180k 0 0:05:48 0:05:01 0:00:47 208k
86 61.4M 86 53.3M 0 0 180k 0 0:05:48 0:05:02 0:00:46 172k
87 61.4M 87 53.5M 0 0 180k 0 0:05:48 0:05:03 0:00:45 176k
87 61.4M 87 53.8M 0 0 180k 0 0:05:47 0:05:04 0:00:43 192k
87 61.4M 87 54.0M 0 0 180k 0 0:05:47 0:05:05 0:00:42 171k
88 61.4M 88 54.1M 0 0 180k 0 0:05:47 0:05:06 0:00:41 187k
88 61.4M 88 54.2M 0 0 180k 0 0:05:48 0:05:07 0:00:41 195k
88 61.4M 88 54.4M 0 0 180k 0 0:05:48 0:05:08 0:00:40 191k
88 61.4M 88 54.6M 0 0 180k 0 0:05:47 0:05:09 0:00:38 175k
89 61.4M 89 55.0M 0 0 181k 0 0:05:46 0:05:10 0:00:36 211k
90 61.4M 90 55.3M 0 0 182k 0 0:05:45 0:05:11 0:00:34 250k
90 61.4M 90 55.6M 0 0 182k 0 0:05:45 0:05:12 0:00:33 285k
91 61.4M 91 55.9M 0 0 182k 0 0:05:44 0:05:13 0:00:31 301k
91 61.4M 91 56.1M 0 0 182k 0 0:05:44 0:05:14 0:00:30 297k
91 61.4M 91 56.3M 0 0 182k 0 0:05:44 0:05:15 0:00:29 271k
92 61.4M 92 56.6M 0 0 183k 0 0:05:43 0:05:16 0:00:27 259k
92 61.4M 92 56.9M 0 0 183k 0 0:05:42 0:05:17 0:00:25 262k
93 61.4M 93 57.2M 0 0 184k 0 0:05:41 0:05:18 0:00:23 275k
93 61.4M 93 57.5M 0 0 184k 0 0:05:41 0:05:19 0:00:22 288k
94 61.4M 94 57.8M 0 0 184k 0 0:05:40 0:05:20 0:00:20 303k
94 61.4M 94 58.3M 0 0 185k 0 0:05:38 0:05:21 0:00:17 342k
95 61.4M 95 58.7M 0 0 186k 0 0:05:37 0:05:22 0:00:15 371k
96 61.4M 96 59.3M 0 0 187k 0 0:05:35 0:05:23 0:00:12 416k
97 61.4M 97 59.8M 0 0 188k 0 0:05:33 0:05:24 0:00:09 473k
98 61.4M 98 60.4M 0 0 190k 0 0:05:30 0:05:25 0:00:05 541k
99 61.4M 99 61.2M 0 0 191k 0 0:05:27 0:05:26 0:00:01 591k
100 61.4M 100 61.4M 0 0 192k 0 0:05:27 0:05:27 --:--:-- 617k
Extract the archive file into the tutorial directory:
tar xf tutorial/data.tar.gz --directory=tutorial
The software we are going to use in this tutorial can be installed using the conda package manager. Please refer to the previous conda workshop for details on installing software and creating conda environments.
Create a new environment with FastQC installed:
conda create --yes --name fastqc fastqc
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... done
##
## ## Package Plan ##
##
## environment location: /opt/miniconda3/envs/fastqc
##
## added / updated specs:
## - fastqc
##
##
## The following NEW packages will be INSTALLED:
##
## expat conda-forge/osx-64::expat-2.4.9-hf0c8a7f_0
## fastqc bioconda/noarch::fastqc-0.11.9-hdfd78af_1
## font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0
## fontconfig conda-forge/osx-64::fontconfig-2.14.0-h5bb23bf_1
## freetype conda-forge/osx-64::freetype-2.12.1-h3f81eb7_0
## libcxx conda-forge/osx-64::libcxx-14.0.6-hccf4f1f_0
## libpng conda-forge/osx-64::libpng-1.6.38-ha978bb4_0
## libzlib conda-forge/osx-64::libzlib-1.2.12-hfd90126_3
## openjdk conda-forge/osx-64::openjdk-17.0.3-hbc0c0cd_2
## perl conda-forge/osx-64::perl-5.32.1-2_h0d85af4_perl5
##
##
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## # $ conda activate fastqc
## #
## # To deactivate an active environment, use
## #
## # $ conda deactivate
##
## Retrieving notices: ...working... done
Activate the new environment to use it:
conda activate fastqc
Test that the fastqc command is available:
which fastqc
## /opt/miniconda3/envs/fastqc/bin/fastqc
Create a new environment with Cutadapt installed:
conda create --yes --name cutadapt cutadapt
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... done
##
## ## Package Plan ##
##
## environment location: /opt/miniconda3/envs/cutadapt
##
## added / updated specs:
## - cutadapt
##
##
## The following NEW packages will be INSTALLED:
##
## bzip2 conda-forge/osx-64::bzip2-1.0.8-h0d85af4_4
## ca-certificates conda-forge/osx-64::ca-certificates-2022.9.14-h033912b_0
## cutadapt bioconda/osx-64::cutadapt-4.1-py310hd528523_1
## dnaio bioconda/osx-64::dnaio-0.9.1-py310hd528523_1
## isa-l conda-forge/osx-64::isa-l-2.30.0-h0d85af4_4
## libffi conda-forge/osx-64::libffi-3.4.2-h0d85af4_5
## libsqlite conda-forge/osx-64::libsqlite-3.39.3-ha978bb4_0
## libzlib conda-forge/osx-64::libzlib-1.2.12-hfd90126_3
## ncurses conda-forge/osx-64::ncurses-6.3-h96cf925_1
## openssl conda-forge/osx-64::openssl-3.0.5-hfd90126_2
## pbzip2 conda-forge/osx-64::pbzip2-1.1.13-h9d27c22_1
## pigz conda-forge/osx-64::pigz-2.6-h5dbffcc_0
## pip conda-forge/noarch::pip-22.2.2-pyhd8ed1ab_0
## python conda-forge/osx-64::python-3.10.6-hc14f532_0_cpython
## python-isal conda-forge/osx-64::python-isal-1.0.1-py310h3c08dca_0
## python_abi conda-forge/osx-64::python_abi-3.10-2_cp310
## readline conda-forge/osx-64::readline-8.1.2-h3899abd_0
## setuptools conda-forge/noarch::setuptools-65.3.0-pyhd8ed1ab_1
## tk conda-forge/osx-64::tk-8.6.12-h5dbffcc_0
## tzdata conda-forge/noarch::tzdata-2022c-h191b570_0
## wheel conda-forge/noarch::wheel-0.37.1-pyhd8ed1ab_0
## xopen conda-forge/osx-64::xopen-1.6.0-py310h2ec42d9_0
## xz conda-forge/osx-64::xz-5.2.6-h775f41a_0
## zlib conda-forge/osx-64::zlib-1.2.12-hfd90126_3
##
##
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## # $ conda activate cutadapt
## #
## # To deactivate an active environment, use
## #
## # $ conda deactivate
##
## Retrieving notices: ...working... done
Activate the new environment to use it:
conda activate cutadapt
Test that the cutadapt command is available:
which cutadapt
## /opt/miniconda3/envs/cutadapt/bin/cutadapt
Quality control is an important step in the processing of any type of data. It tells the researcher whether the data is suitable for analysis and whether the conclusions drawn can be trusted. Data integrity however is just one part of the quality control procedure and many other issues, outside of the data, may remain.
In the example data directory are two FASTQ files. The ‘good’ file has no quality problems and passes all of the FastQC modules. The ‘bad’ file has lots of problems and would not be suitable for data analysis. Lets run FastQC on both files and compare the results.
Activate the fastqc environment:
conda activate fastqc
Print the help information for the fastqc command:
fastqc --help
##
## FastQC - A high throughput sequence QC analysis tool
##
## SYNOPSIS
##
## fastqc seqfile1 seqfile2 .. seqfileN
##
## fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
## [-c contaminant file] seqfile1 .. seqfileN
##
## DESCRIPTION
##
## FastQC reads a set of sequence files and produces from each one a quality
## control report consisting of a number of different modules, each one of
## which will help to identify a different potential type of problem in your
## data.
##
## If no files to process are specified on the command line then the program
## will start as an interactive graphical application. If files are provided
## on the command line then the program will run with no user interaction
## required. In this mode it is suitable for inclusion into a standardised
## analysis pipeline.
##
## The options for the program as as follows:
##
## -h --help Print this help file and exit
##
## -v --version Print the version of the program and exit
##
## -o --outdir Create all output files in the specified output directory.
## Please note that this directory must exist as the program
## will not create it. If this option is not set then the
## output file for each sequence file is created in the same
## directory as the sequence file which was processed.
##
## --casava Files come from raw casava output. Files in the same sample
## group (differing only by the group number) will be analysed
## as a set rather than individually. Sequences with the filter
## flag set in the header will be excluded from the analysis.
## Files must have the same names given to them by casava
## (including being gzipped and ending with .gz) otherwise they
## won't be grouped together correctly.
##
## --nano Files come from nanopore sequences and are in fast5 format. In
## this mode you can pass in directories to process and the program
## will take in all fast5 files within those directories and produce
## a single output file from the sequences found in all files.
##
## --nofilter If running with --casava then don't remove read flagged by
## casava as poor quality when performing the QC analysis.
##
## --extract If set then the zipped output file will be uncompressed in
## the same directory after it has been created. By default
## this option will be set if fastqc is run in non-interactive
## mode.
##
## -j --java Provides the full path to the java binary you want to use to
## launch fastqc. If not supplied then java is assumed to be in
## your path.
##
## --noextract Do not uncompress the output file after creating it. You
## should set this option if you do not wish to uncompress
## the output when running in non-interactive mode.
##
## --nogroup Disable grouping of bases for reads >50bp. All reports will
## show data for every base in the read. WARNING: Using this
## option will cause fastqc to crash and burn if you use it on
## really long reads, and your plots may end up a ridiculous size.
## You have been warned!
##
## --min_length Sets an artificial lower limit on the length of the sequence
## to be shown in the report. As long as you set this to a value
## greater or equal to your longest read length then this will be
## the sequence length used to create your read groups. This can
## be useful for making directly comaparable statistics from
## datasets with somewhat variable read lengths.
##
## -f --format Bypasses the normal sequence file format detection and
## forces the program to use the specified format. Valid
## formats are bam,sam,bam_mapped,sam_mapped and fastq
##
## -t --threads Specifies the number of files which can be processed
## simultaneously. Each thread will be allocated 250MB of
## memory so you shouldn't run more threads than your
## available memory will cope with, and not more than
## 6 threads on a 32 bit machine
##
## -c Specifies a non-default file which contains the list of
## --contaminants contaminants to screen overrepresented sequences against.
## The file must contain sets of named contaminants in the
## form name[tab]sequence. Lines prefixed with a hash will
## be ignored.
##
## -a Specifies a non-default file which contains the list of
## --adapters adapter sequences which will be explicity searched against
## the library. The file must contain sets of named adapters
## in the form name[tab]sequence. Lines prefixed with a hash
## will be ignored.
##
## -l Specifies a non-default file which contains a set of criteria
## --limits which will be used to determine the warn/error limits for the
## various modules. This file can also be used to selectively
## remove some modules from the output all together. The format
## needs to mirror the default limits.txt file found in the
## Configuration folder.
##
## -k --kmers Specifies the length of Kmer to look for in the Kmer content
## module. Specified Kmer length must be between 2 and 10. Default
## length is 7 if not specified.
##
## -q --quiet Supress all progress messages on stdout and only report errors.
##
## -d --dir Selects a directory to be used for temporary files written when
## generating report images. Defaults to system temp directory if
## not specified.
##
## BUGS
##
## Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
## or in www.bioinformatics.babraham.ac.uk/bugzilla/
##
##
Create a fastqc output directory:
mkdir tutorial/fastqc
Run FastQC to generate a quality control report for the ‘good’ data:
fastqc -o tutorial/fastqc tutorial/data/good.fastq.gz
## Started analysis of good.fastq.gz
## Approx 5% complete for good.fastq.gz
## Approx 10% complete for good.fastq.gz
## Approx 15% complete for good.fastq.gz
## Approx 20% complete for good.fastq.gz
## Approx 25% complete for good.fastq.gz
## Approx 30% complete for good.fastq.gz
## Approx 35% complete for good.fastq.gz
## Approx 40% complete for good.fastq.gz
## Approx 45% complete for good.fastq.gz
## Approx 50% complete for good.fastq.gz
## Approx 55% complete for good.fastq.gz
## Approx 60% complete for good.fastq.gz
## Approx 65% complete for good.fastq.gz
## Approx 70% complete for good.fastq.gz
## Approx 75% complete for good.fastq.gz
## Approx 80% complete for good.fastq.gz
## Approx 85% complete for good.fastq.gz
## Approx 90% complete for good.fastq.gz
## Approx 95% complete for good.fastq.gz
## Approx 100% complete for good.fastq.gz
## Analysis complete for good.fastq.gz
Inspect the fastqc output directory:
ls tutorial/fastqc
## good_fastqc.html
## good_fastqc.zip
You should see that FastQC generates a .html report and a .zip folder. The HTML report contains the results from a number of tests which FastQC runs on the sequencing data. For an explanation of the tests, please refer to the FastQC documentation for each one:
Inspect the quality control report for the ‘good’ data:
The report uses a traffic light system (Green, Yellow, Red) to grade the result from each test (Pass, Warn, Fail). For this particular example, the data passed all of the tests. Once your data passes the majority of these tests it is safe to proceed with any downstream analysis or processing.
Importantly, while these tests are useful, do not rely on them to absolve you of any responsibility. Data which passes all of these tests may still exhibit biases which are not detected, and vice versa. It is important to think about what the tests are measuring and whether those measurements make sense for your particular data.
Lets now extract the .zip folder and see what it contains:
unzip -o tutorial/fastqc/good_fastqc.zip -d tutorial/fastqc
## Archive: tutorial/fastqc/good_fastqc.zip
## creating: tutorial/fastqc/good_fastqc/
## creating: tutorial/fastqc/good_fastqc/Icons/
## creating: tutorial/fastqc/good_fastqc/Images/
## inflating: tutorial/fastqc/good_fastqc/Icons/fastqc_icon.png
## inflating: tutorial/fastqc/good_fastqc/Icons/warning.png
## inflating: tutorial/fastqc/good_fastqc/Icons/error.png
## inflating: tutorial/fastqc/good_fastqc/Icons/tick.png
## inflating: tutorial/fastqc/good_fastqc/summary.txt
## inflating: tutorial/fastqc/good_fastqc/Images/per_base_quality.png
## inflating: tutorial/fastqc/good_fastqc/Images/per_tile_quality.png
## inflating: tutorial/fastqc/good_fastqc/Images/per_sequence_quality.png
## inflating: tutorial/fastqc/good_fastqc/Images/per_base_sequence_content.png
## inflating: tutorial/fastqc/good_fastqc/Images/per_sequence_gc_content.png
## inflating: tutorial/fastqc/good_fastqc/Images/per_base_n_content.png
## inflating: tutorial/fastqc/good_fastqc/Images/sequence_length_distribution.png
## inflating: tutorial/fastqc/good_fastqc/Images/duplication_levels.png
## inflating: tutorial/fastqc/good_fastqc/Images/adapter_content.png
## inflating: tutorial/fastqc/good_fastqc/fastqc_report.html
## inflating: tutorial/fastqc/good_fastqc/fastqc_data.txt
## inflating: tutorial/fastqc/good_fastqc/fastqc.fo
List all files in the extracted folder:
ls tutorial/fastqc/good_fastqc
## Icons
## Images
## fastqc.fo
## fastqc_data.txt
## fastqc_report.html
## summary.txt
The folder contains images and supplementary data. The images are those present in the .html report so we don’t need to look at these again. The others we will look at are the fastqc_data.txt and summary.txt files.
First, lets display the contents of the fastqc_data.txt file:
cat tutorial/fastqc/good_fastqc/fastqc_data.txt
## ##FastQC 0.11.9
## >>Basic Statistics pass
## #Measure Value
## Filename good.fastq.gz
## File type Conventional base calls
## Encoding Illumina 1.5
## Total Sequences 250000
## Sequences flagged as poor quality 0
## Sequence length 40
## %GC 45
## >>END_MODULE
## >>Per base sequence quality pass
## #Base Mean Median Lower Quartile Upper Quartile 10th Percentile 90th Percentile
## 1 37.018596 38.0 38.0 38.0 35.0 38.0
## 2 36.964208 38.0 37.0 38.0 35.0 38.0
## 3 37.128464 38.0 37.0 38.0 35.0 38.0
## 4 37.111792 38.0 37.0 38.0 35.0 38.0
## 5 37.154904 38.0 38.0 38.0 36.0 38.0
## 6 37.205776 38.0 38.0 38.0 36.0 38.0
## 7 37.206184 38.0 38.0 38.0 36.0 38.0
## 8 37.169392 38.0 38.0 38.0 36.0 38.0
## 9 37.15452 38.0 38.0 38.0 35.0 38.0
## 10 37.126808 38.0 37.0 38.0 35.0 38.0
## 11 36.82764 38.0 37.0 38.0 35.0 38.0
## 12 36.893108 38.0 37.0 38.0 35.0 38.0
## 13 36.979468 38.0 37.0 38.0 35.0 38.0
## 14 36.996272 38.0 37.0 38.0 35.0 38.0
## 15 36.903852 38.0 37.0 38.0 35.0 38.0
## 16 37.063328 38.0 37.0 38.0 35.0 38.0
## 17 37.044728 38.0 37.0 38.0 35.0 38.0
## 18 37.005588 38.0 37.0 38.0 35.0 38.0
## 19 37.023184 38.0 37.0 38.0 35.0 38.0
## 20 37.025436 38.0 37.0 38.0 35.0 38.0
## 21 36.98352 38.0 37.0 38.0 35.0 38.0
## 22 36.962568 38.0 37.0 38.0 35.0 38.0
## 23 36.94942 38.0 37.0 38.0 35.0 38.0
## 24 36.937728 38.0 37.0 38.0 35.0 38.0
## 25 36.913956 38.0 37.0 38.0 35.0 38.0
## 26 36.838288 38.0 37.0 38.0 35.0 38.0
## 27 36.810176 38.0 37.0 38.0 35.0 38.0
## 28 36.779396 38.0 37.0 38.0 35.0 38.0
## 29 36.764576 38.0 37.0 38.0 35.0 38.0
## 30 36.739352 38.0 37.0 38.0 35.0 38.0
## 31 36.680796 38.0 37.0 38.0 35.0 38.0
## 32 36.6185 38.0 37.0 38.0 35.0 38.0
## 33 36.577148 38.0 37.0 38.0 34.0 38.0
## 34 36.517688 38.0 36.0 38.0 34.0 38.0
## 35 36.448552 38.0 36.0 38.0 34.0 38.0
## 36 36.356024 38.0 36.0 38.0 34.0 38.0
## 37 36.196704 38.0 36.0 38.0 33.0 38.0
## 38 36.226052 38.0 36.0 38.0 33.0 38.0
## 39 36.107872 38.0 36.0 38.0 33.0 38.0
## 40 35.623204 37.0 36.0 38.0 32.0 38.0
## >>END_MODULE
## >>Per tile sequence quality pass
## #Tile Base Mean
## 1 1 0.0
## 1 2 0.0
## 1 3 0.0
## 1 4 0.0
## 1 5 0.0
## 1 6 0.0
## 1 7 0.0
## 1 8 0.0
## 1 9 0.0
## 1 10 0.0
## 1 11 0.0
## 1 12 0.0
## 1 13 0.0
## 1 14 0.0
## 1 15 0.0
## 1 16 0.0
## 1 17 0.0
## 1 18 0.0
## 1 19 0.0
## 1 20 0.0
## 1 21 0.0
## 1 22 0.0
## 1 23 0.0
## 1 24 0.0
## 1 25 0.0
## 1 26 0.0
## 1 27 0.0
## 1 28 0.0
## 1 29 0.0
## 1 30 0.0
## 1 31 0.0
## 1 32 0.0
## 1 33 0.0
## 1 34 0.0
## 1 35 0.0
## 1 36 0.0
## 1 37 0.0
## 1 38 0.0
## 1 39 0.0
## 1 40 0.0
## >>END_MODULE
## >>Per sequence quality scores pass
## #Quality Count
## 2 58.0
## 3 1.0
## 4 0.0
## 5 2.0
## 6 3.0
## 7 3.0
## 8 4.0
## 9 5.0
## 10 8.0
## 11 4.0
## 12 14.0
## 13 20.0
## 14 22.0
## 15 28.0
## 16 42.0
## 17 35.0
## 18 65.0
## 19 114.0
## 20 101.0
## 21 128.0
## 22 178.0
## 23 213.0
## 24 292.0
## 25 380.0
## 26 517.0
## 27 796.0
## 28 1142.0
## 29 1485.0
## 30 2171.0
## 31 2900.0
## 32 4039.0
## 33 5175.0
## 34 7369.0
## 35 11430.0
## 36 25412.0
## 37 179886.0
## 38 5958.0
## >>END_MODULE
## >>Per base sequence content pass
## #Base G A T C
## 1 20.016690271096543 26.817975742937726 28.279979297644502 24.88535468832122
## 2 21.0868 26.802 28.2756 23.8356
## 3 21.0276 25.516 28.7004 24.756
## 4 21.5552 25.175199999999997 27.971200000000003 25.298399999999997
## 5 19.8816 25.629600000000003 29.1024 25.3864
## 6 19.877269690654014 25.550546245884654 28.133963252913247 26.438220810548085
## 7 19.584 25.311600000000002 29.321199999999997 25.7832
## 8 19.2504 25.3576 29.3064 26.0856
## 9 19.0136 25.2044 29.3124 26.4696
## 10 19.1928 25.523200000000003 28.650399999999998 26.6336
## 11 19.1492 25.7216 28.875600000000002 26.2536
## 12 19.3004 25.258399999999998 28.9476 26.4936
## 13 19.3076 25.269599999999997 29.1224 26.300400000000003
## 14 18.9756 25.4816 29.212 26.3308
## 15 18.92 25.4032 29.1968 26.479999999999997
## 16 19.0192 25.352400000000003 29.053600000000003 26.5748
## 17 18.9208 25.338 29.2972 26.444000000000003
## 18 18.9768 25.274 29.014 26.7352
## 19 18.9368 25.3268 28.871599999999997 26.8648
## 20 18.9652 25.3764 28.9264 26.732
## 21 18.9752 25.272 28.6188 27.134000000000004
## 22 19.0024 25.354 28.8404 26.8032
## 23 19.0588 25.137999999999998 28.8488 26.9544
## 24 18.9376 25.0708 28.8868 27.1048
## 25 19.016 25.163200000000003 28.9584 26.862399999999997
## 26 18.9668 25.0468 28.9968 26.989600000000003
## 27 18.9696 25.0456 28.967599999999997 27.017200000000003
## 28 18.788552775799857 25.150610843980765 28.99288748789913 27.06794889232025
## 29 18.859101745627928 25.167602681642908 28.884862157794526 27.088433414934638
## 30 18.996399999999998 25.243199999999998 28.708 27.0524
## 31 18.903200000000002 25.0704 28.792 27.234399999999997
## 32 18.9176 25.2032 28.7708 27.1084
## 33 19.056 24.9164 28.7568 27.2708
## 34 18.944 25.190800000000003 28.721200000000003 27.144000000000002
## 35 18.7776 24.898 29.1236 27.2008
## 36 18.7084 24.7668 29.164 27.3608
## 37 18.706 25.014799999999997 28.8904 27.388800000000003
## 38 18.6152 25.1504 28.898000000000003 27.3364
## 39 18.692 24.8808 28.861199999999997 27.566000000000003
## 40 18.691899070385126 25.036000576009215 28.817661082577324 27.454439271028335
## >>END_MODULE
## >>Per sequence GC content pass
## #GC Content Count
## 0 9.0
## 1 5.5
## 2 2.0
## 3 2.0
## 4 8.0
## 5 14.0
## 6 26.0
## 7 38.0
## 8 38.0
## 9 53.5
## 10 69.0
## 11 125.0
## 12 181.0
## 13 181.0
## 14 255.0
## 15 329.0
## 16 460.0
## 17 591.0
## 18 591.0
## 19 801.5
## 20 1012.0
## 21 1333.0
## 22 1654.0
## 23 1654.0
## 24 2291.5
## 25 2929.0
## 26 3727.5
## 27 4526.0
## 28 4526.0
## 29 6006.5
## 30 7487.0
## 31 9478.5
## 32 11470.0
## 33 11470.0
## 34 13628.0
## 35 15786.0
## 36 16785.5
## 37 17785.0
## 38 17785.0
## 39 18616.0
## 40 19447.0
## 41 20321.5
## 42 21196.0
## 43 21196.0
## 44 21714.5
## 45 22233.0
## 46 22642.0
## 47 23051.0
## 48 23051.0
## 49 23179.5
## 50 23308.0
## 51 22107.5
## 52 20907.0
## 53 20907.0
## 54 19226.5
## 55 17546.0
## 56 15523.5
## 57 13501.0
## 58 13501.0
## 59 11574.5
## 60 9648.0
## 61 8011.0
## 62 6374.0
## 63 6374.0
## 64 5222.5
## 65 4071.0
## 66 3167.0
## 67 2263.0
## 68 2263.0
## 69 1707.5
## 70 1152.0
## 71 877.0
## 72 602.0
## 73 602.0
## 74 485.5
## 75 369.0
## 76 277.0
## 77 185.0
## 78 185.0
## 79 151.0
## 80 117.0
## 81 88.0
## 82 59.0
## 83 59.0
## 84 52.5
## 85 46.0
## 86 32.5
## 87 19.0
## 88 19.0
## 89 15.5
## 90 12.0
## 91 9.0
## 92 6.0
## 93 6.0
## 94 4.5
## 95 3.0
## 96 2.0
## 97 1.0
## 98 1.0
## 99 1.5
## 100 2.0
## >>END_MODULE
## >>Per base N content pass
## #Base N-Count
## 1 0.30119999999999997
## 2 0.0
## 3 0.0
## 4 0.0
## 5 0.0
## 6 0.0084
## 7 0.0
## 8 0.0
## 9 0.0
## 10 0.0
## 11 0.0
## 12 0.0
## 13 0.0
## 14 0.0
## 15 0.0
## 16 0.0
## 17 0.0
## 18 0.0
## 19 0.0
## 20 0.0
## 21 0.0
## 22 0.0
## 23 0.0
## 24 0.0
## 25 0.0
## 26 0.0
## 27 0.0
## 28 0.0072
## 29 0.0015999999999999999
## 30 0.0
## 31 0.0
## 32 0.0
## 33 0.0
## 34 0.0
## 35 0.0
## 36 0.0
## 37 0.0
## 38 0.0
## 39 0.0
## 40 0.0015999999999999999
## >>END_MODULE
## >>Sequence Length Distribution pass
## #Length Count
## 40 250000.0
## >>END_MODULE
## >>Sequence Duplication Levels pass
## #Total Deduplicated Percentage 94.81007801740206
## #Duplication Level Percentage of deduplicated Percentage of total
## 1 98.82446040062165 93.695548006106
## 2 0.8184774289647235 1.5519981779125618
## 3 0.13770282308459525 0.39166846199600974
## 4 0.05196687276280436 0.1970793304384755
## 5 0.02849674855564644 0.13508894769015645
## 6 0.013177264859977599 0.07496025056582681
## 7 0.008639504778333843 0.05733784854458847
## 8 0.006845752143545901 0.05192370358539074
## 9 0.002127511777105116 0.018153860181324994
## >10 0.09248166995806788 2.7312695332740327
## >50 0.013090397224314817 0.7826947074312642
## >100 0.002533625269222222 0.31227717227436624
## >500 0.0 0.0
## >1k 0.0 0.0
## >5k 0.0 0.0
## >10k+ 0.0 0.0
## >>END_MODULE
## >>Overrepresented sequences pass
## >>END_MODULE
## >>Adapter Content pass
## #Position Illumina Universal Adapter Illumina Small RNA 3' Adapter Illumina Small RNA 5' Adapter Nextera Transposase Sequence SOLID Small RNA Adapter
## 1 0.0 0.0 0.0 0.0 0.0
## 2 0.0 0.0 0.0 0.0 0.0
## 3 0.0 0.0 0.0 0.0 0.0
## 4 0.0 0.0 0.0 0.0 0.0
## 5 0.0 0.0 0.0 0.0 0.0
## 6 0.0 0.0 0.0 0.0 4.0E-4
## 7 0.0 0.0 0.0 0.0 4.0E-4
## 8 0.0 0.0 0.0 0.0 4.0E-4
## 9 0.0 0.0 0.0 0.0 4.0E-4
## 10 0.0 0.0 0.0 0.0 4.0E-4
## 11 0.0 0.0 0.0 0.0 4.0E-4
## 12 0.0 0.0 0.0 0.0 4.0E-4
## 13 0.0 0.0 0.0 0.0 4.0E-4
## 14 0.0 0.0 0.0 0.0 4.0E-4
## 15 0.0 0.0 0.0 0.0 4.0E-4
## 16 0.0 0.0 0.0 0.0 4.0E-4
## 17 0.0 0.0 0.0 0.0 4.0E-4
## 18 0.0 0.0 0.0 0.0 4.0E-4
## 19 0.0 0.0 0.0 0.0 4.0E-4
## 20 0.0036 0.0 0.0 0.0 4.0E-4
## 21 0.0044 0.0 0.0 0.0 4.0E-4
## 22 0.0044 0.0 0.0 0.0 4.0E-4
## 23 0.0052 0.0 0.0 0.0 4.0E-4
## 24 0.0052 0.0 0.0 0.0 4.0E-4
## 25 0.0052 0.0 0.0 0.0 4.0E-4
## 26 0.0052 0.0 0.0 0.0 4.0E-4
## 27 0.0052 0.0 0.0 0.0 4.0E-4
## 28 0.0052 0.0 0.0 4.0E-4 4.0E-4
## 29 0.0052 0.0 0.0 4.0E-4 4.0E-4
## >>END_MODULE
As you may have guessed, this file contains all of the data FastQC calculates when running each of the test modules. The layout of the file is not really amenable to processing on the command line so we won’t inspect it any further.
Next, display the contents of the summary.txt file:
cat tutorial/fastqc/good_fastqc/summary.txt
## PASS Basic Statistics good.fastq.gz
## PASS Per base sequence quality good.fastq.gz
## PASS Per tile sequence quality good.fastq.gz
## PASS Per sequence quality scores good.fastq.gz
## PASS Per base sequence content good.fastq.gz
## PASS Per sequence GC content good.fastq.gz
## PASS Per base N content good.fastq.gz
## PASS Sequence Length Distribution good.fastq.gz
## PASS Sequence Duplication Levels good.fastq.gz
## PASS Overrepresented sequences good.fastq.gz
## PASS Adapter Content good.fastq.gz
Obviously, this file contains a summary of the results for each test module. The first column lists the result, the second is the test module and the third is the input file. For comparison, lets now run FastQC on the ‘bad’ file included in the example data directory.
Run FastQC to generate a quality control report for the ‘bad’ data:
fastqc -o tutorial/fastqc tutorial/data/bad.fastq.gz
## Started analysis of bad.fastq.gz
## Approx 5% complete for bad.fastq.gz
## Approx 10% complete for bad.fastq.gz
## Approx 15% complete for bad.fastq.gz
## Approx 20% complete for bad.fastq.gz
## Approx 25% complete for bad.fastq.gz
## Approx 30% complete for bad.fastq.gz
## Approx 35% complete for bad.fastq.gz
## Approx 40% complete for bad.fastq.gz
## Approx 45% complete for bad.fastq.gz
## Approx 50% complete for bad.fastq.gz
## Approx 55% complete for bad.fastq.gz
## Approx 60% complete for bad.fastq.gz
## Approx 65% complete for bad.fastq.gz
## Approx 70% complete for bad.fastq.gz
## Approx 75% complete for bad.fastq.gz
## Approx 80% complete for bad.fastq.gz
## Approx 85% complete for bad.fastq.gz
## Approx 90% complete for bad.fastq.gz
## Approx 95% complete for bad.fastq.gz
## Approx 100% complete for bad.fastq.gz
## Analysis complete for bad.fastq.gz
Inspect the quality control report:
As expected, this data has a lot of problems:
So what can we do about this? It would be impractical to re-do the sequencing again and again until we end up with ‘good’ quality data. Instead, we can process or ‘clean’ the data using read trimming software.
Read trimming is used to ‘clean’ reads which exhibit adapter contamination or low-quality bases. There is an endless number of read trimming programs available, but all of them basically do the same thing. How fast and how many options they have is usually what differentiates them. We are going to use Cutadapt mainly because of the extensive documentation. Do not underestimate how important and rare well organised and detailed documentation is in the world of bioinformatics software.
In the example data directory is a ‘contaminated’ FASTQ file. We are going to use this file to demonstrate the effect read trimming has on the quality of the data, as measured by the FastQC report.
Run FastQC to generate a quality control report before trimming:
fastqc -o tutorial/fastqc tutorial/data/contam.fastq.gz
## Started analysis of contam.fastq.gz
## Approx 5% complete for contam.fastq.gz
## Approx 10% complete for contam.fastq.gz
## Approx 15% complete for contam.fastq.gz
## Approx 20% complete for contam.fastq.gz
## Approx 25% complete for contam.fastq.gz
## Approx 30% complete for contam.fastq.gz
## Approx 35% complete for contam.fastq.gz
## Approx 40% complete for contam.fastq.gz
## Approx 45% complete for contam.fastq.gz
## Approx 50% complete for contam.fastq.gz
## Approx 55% complete for contam.fastq.gz
## Approx 60% complete for contam.fastq.gz
## Approx 65% complete for contam.fastq.gz
## Approx 70% complete for contam.fastq.gz
## Approx 75% complete for contam.fastq.gz
## Approx 80% complete for contam.fastq.gz
## Approx 85% complete for contam.fastq.gz
## Approx 90% complete for contam.fastq.gz
## Approx 95% complete for contam.fastq.gz
## Approx 100% complete for contam.fastq.gz
## Analysis complete for contam.fastq.gz
Inspect the quality control report:
The un-trimmed data has the following issues:
We can use Cutadapt to address these issues by performing read quality and adapter trimming. Cutadapt has a number of options and it would be worth you reading through these at some point to get a sense of what problems can and can’t be addressed.
Activate the cutadapt environment:
conda activate cutadapt
Print the help information for the cutadapt command:
cutadapt --help
## cutadapt version 4.1
##
## Copyright (C) 2010-2022 Marcel Martin <marcel.martin@scilifelab.se>
##
## cutadapt removes adapter sequences from high-throughput sequencing reads.
##
## Usage:
## cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
##
## For paired-end reads:
## cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
##
## Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
## characters are supported. All reads from input.fastq will be written to
## output.fastq with the adapter sequence removed. Adapter matching is
## error-tolerant. Multiple adapter sequences can be given (use further -a
## options), but only the best-matching adapter will be removed.
##
## Input may also be in FASTA format. Compressed input and output is supported and
## auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
## standard input/output. Without the -o option, output is sent to standard output.
##
## Citation:
##
## Marcel Martin. Cutadapt removes adapter sequences from high-throughput
## sequencing reads. EMBnet.Journal, 17(1):10-12, May 2011.
## http://dx.doi.org/10.14806/ej.17.1.200
##
## Run "cutadapt --help" to see all command-line options.
## See https://cutadapt.readthedocs.io/ for full documentation.
##
## Options:
## -h, --help Show this help message and exit
## --version Show version number and exit
## --debug Print debug log. Use twice to also print DP matrices
## -j CORES, --cores CORES
## Number of CPU cores to use. Use 0 to auto-detect.
## Default: 1
##
## Finding adapters:
## Parameters -a, -g, -b specify adapters to be removed from each read (or from
## R1 if data is paired-end. If specified multiple times, only the best
## matching adapter is trimmed (but see the --times option). Use notation
## 'file:FILE' to read adapter sequences from a FASTA file.
##
## -a ADAPTER, --adapter ADAPTER
## Sequence of an adapter ligated to the 3' end (paired
## data: of the first read). The adapter and subsequent
## bases are trimmed. If a '$' character is appended
## ('anchoring'), the adapter is only found if it is a
## suffix of the read.
## -g ADAPTER, --front ADAPTER
## Sequence of an adapter ligated to the 5' end (paired
## data: of the first read). The adapter and any preceding
## bases are trimmed. Partial matches at the 5' end are
## allowed. If a '^' character is prepended ('anchoring'),
## the adapter is only found if it is a prefix of the read.
## -b ADAPTER, --anywhere ADAPTER
## Sequence of an adapter that may be ligated to the 5' or
## 3' end (paired data: of the first read). Both types of
## matches as described under -a and -g are allowed. If the
## first base of the read is part of the match, the
## behavior is as with -g, otherwise as with -a. This
## option is mostly for rescuing failed library
## preparations - do not use if you know which end your
## adapter was ligated to!
## -e E, --error-rate E, --errors E
## Maximum allowed error rate (if 0 <= E < 1), or absolute
## number of errors for full-length adapter match (if E is
## an integer >= 1). Error rate = no. of errors divided by
## length of matching region. Default: 0.1 (10%)
## --no-indels Allow only mismatches in alignments. Default: allow both
## mismatches and indels
## -n COUNT, --times COUNT
## Remove up to COUNT adapters from each read. Default: 1
## -O MINLENGTH, --overlap MINLENGTH
## Require MINLENGTH overlap between read and adapter for
## an adapter to be found. Default: 3
## --match-read-wildcards
## Interpret IUPAC wildcards in reads. Default: False
## -N, --no-match-adapter-wildcards
## Do not interpret IUPAC wildcards in adapters.
## --action {trim,retain,mask,lowercase,none}
## What to do if a match was found. trim: trim adapter and
## up- or downstream sequence; retain: trim, but retain
## adapter; mask: replace with 'N' characters; lowercase:
## convert to lowercase; none: leave unchanged. Default:
## trim
## --rc, --revcomp Check both the read and its reverse complement for
## adapter matches. If match is on reverse-complemented
## version, output that one. Default: check only read
##
## Additional read modifications:
## -u LENGTH, --cut LENGTH
## Remove bases from each read (first read only if paired).
## If LENGTH is positive, remove bases from the beginning.
## If LENGTH is negative, remove bases from the end. Can be
## used twice if LENGTHs have different signs. This is
## applied *before* adapter trimming.
## --nextseq-trim 3'CUTOFF
## NextSeq-specific quality trimming (each read). Trims
## also dark cycles appearing as high-quality G bases.
## -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff [5'CUTOFF,]3'CUTOFF
## Trim low-quality bases from 5' and/or 3' ends of each
## read before adapter removal. Applied to both reads if
## data is paired. If one value is given, only the 3' end
## is trimmed. If two comma-separated cutoffs are given,
## the 5' end is trimmed with the first cutoff, the 3' end
## with the second.
## --quality-base N Assume that quality values in FASTQ are encoded as
## ascii(quality + N). This needs to be set to 64 for some
## old Illumina FASTQ files. Default: 33
## --length LENGTH, -l LENGTH
## Shorten reads to LENGTH. Positive values remove bases at
## the end while negative ones remove bases at the
## beginning. This and the following modifications are
## applied after adapter trimming.
## --trim-n Trim N's on ends of reads.
## --length-tag TAG Search for TAG followed by a decimal number in the
## description field of the read. Replace the decimal
## number with the correct length of the trimmed read. For
## example, use --length-tag 'length=' to correct fields
## like 'length=123'.
## --strip-suffix STRIP_SUFFIX
## Remove this suffix from read names if present. Can be
## given multiple times.
## -x PREFIX, --prefix PREFIX
## Add this prefix to read names. Use {name} to insert the
## name of the matching adapter.
## -y SUFFIX, --suffix SUFFIX
## Add this suffix to read names; can also include {name}
## --rename TEMPLATE Rename reads using TEMPLATE containing variables such as
## {id}, {adapter_name} etc. (see documentation)
## --zero-cap, -z Change negative quality values to zero.
##
## Filtering of processed reads:
## Filters are applied after above read modifications. Paired-end reads are
## always discarded pairwise (see also --pair-filter).
##
## -m LEN[:LEN2], --minimum-length LEN[:LEN2]
## Discard reads shorter than LEN. Default: 0
## -M LEN[:LEN2], --maximum-length LEN[:LEN2]
## Discard reads longer than LEN. Default: no limit
## --max-n COUNT Discard reads with more than COUNT 'N' bases. If COUNT
## is a number between 0 and 1, it is interpreted as a
## fraction of the read length.
## --max-expected-errors ERRORS, --max-ee ERRORS
## Discard reads whose expected number of errors (computed
## from quality values) exceeds ERRORS.
## --discard-trimmed, --discard
## Discard reads that contain an adapter. Use also -O to
## avoid discarding too many randomly matching reads.
## --discard-untrimmed, --trimmed-only
## Discard reads that do not contain an adapter.
## --discard-casava Discard reads that did not pass CASAVA filtering (header
## has :Y:).
##
## Output:
## --quiet Print only error messages.
## --report {full,minimal}
## Which type of report to print: 'full' or 'minimal'.
## Default: full
## --json FILE Dump report in JSON format to FILE
## -o FILE, --output FILE
## Write trimmed reads to FILE. FASTQ or FASTA format is
## chosen depending on input. Summary report is sent to
## standard output. Use '{name}' for demultiplexing (see
## docs). Default: write to standard output
## --fasta Output FASTA to standard output even on FASTQ input.
## -Z Use compression level 1 for gzipped output files
## (faster, but uses more space)
## --info-file FILE Write information about each read and its adapter
## matches into FILE. See the documentation for the file
## format.
## -r FILE, --rest-file FILE
## When the adapter matches in the middle of a read, write
## the rest (after the adapter) to FILE.
## --wildcard-file FILE When the adapter has N wildcard bases, write adapter
## bases matching wildcard positions to FILE. (Inaccurate
## with indels.)
## --too-short-output FILE
## Write reads that are too short (according to length
## specified by -m) to FILE. Default: discard reads
## --too-long-output FILE
## Write reads that are too long (according to length
## specified by -M) to FILE. Default: discard reads
## --untrimmed-output FILE
## Write reads that do not contain any adapter to FILE.
## Default: output to same file as trimmed reads
##
## Paired-end options:
## The -A/-G/-B/-U/-Q options work like their lowercase counterparts, but are
## applied to R2 (second read in pair)
##
## -A ADAPTER 3' adapter to be removed from R2
## -G ADAPTER 5' adapter to be removed from R2
## -B ADAPTER 5'/3 adapter to be removed from R2
## -U LENGTH Remove LENGTH bases from R2
## -Q [5'CUTOFF,]3'CUTOFF
## Quality-trimming cutoff for R2. Default: same as for R1
## -p FILE, --paired-output FILE
## Write R2 to FILE.
## --pair-adapters Treat adapters given with -a/-A etc. as pairs. Either
## both or none are removed from each read pair.
## --pair-filter {any,both,first}
## Which of the reads in a paired-end read have to match
## the filtering criterion in order for the pair to be
## filtered. Default: any
## --interleaved Read and/or write interleaved paired-end reads.
## --untrimmed-paired-output FILE
## Write second read in a pair to this FILE when no adapter
## was found. Use with --untrimmed-output. Default: output
## to same file as trimmed reads
## --too-short-paired-output FILE
## Write second read in a pair to this file if pair is too
## short.
## --too-long-paired-output FILE
## Write second read in a pair to this file if pair is too
## long.
Create a directory for the cutadapt output files:
mkdir tutorial/cutadapt
Before we perform read trimming, we need to get the sequence of the contaminating adapter. You can either ask the person who prepared the libraries for the adapter sequence they used or simply search online for the corresponding sequence. Helpfully, Illumina maintains an adapter sequences document which lists all of the adapters used in their library preparation kits.
Lets now run Cutadapt to trim the adapter sequence and low quality bases:
# -q 28 | Trim low-quality bases from 3' end of each read
# -a AGATCGGAAGAGC | Illumina adapter sequence ligated to the 3' end
# -m 26 | Discard reads shorter than 26 bases
cutadapt -q 28 -a AGATCGGAAGAGC -m 36 -o tutorial/cutadapt/trimmed.fastq.gz tutorial/data/contam.fastq.gz
## This is cutadapt 4.1 with Python 3.10.6
## Command line parameters: -q 28 -a AGATCGGAAGAGC -m 36 -o tutorial/cutadapt/trimmed.fastq.gz tutorial/data/contam.fastq.gz
## Processing single-end reads on 1 core ...
## Finished in 3.61 s (14 µs/read; 4.16 M reads/minute).
##
## === Summary ===
##
## Total reads processed: 250,000
## Reads with adapters: 23,895 (9.6%)
##
## == Read fate breakdown ==
## Reads that were too short: 11,389 (4.6%)
## Reads written (passing filters): 238,611 (95.4%)
##
## Total basepairs processed: 62,750,000 bp
## Quality-trimmed: 24,964,552 bp (39.8%)
## Total written (filtered): 36,658,201 bp (58.4%)
##
## === Adapter 1 ===
##
## Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 23895 times
##
## Minimum overlap: 3
## No. of allowed errors:
## 1-9 bp: 0; 10-13 bp: 1
##
## Bases preceding removed adapters:
## A: 26.5%
## C: 27.0%
## G: 22.8%
## T: 23.6%
## none/other: 0.1%
##
## Overview of removed sequences
## length count expect max.err error counts
## 3 2209 3906.2 0 2209
## 4 397 976.6 0 397
## 5 397 244.1 0 397
## 6 295 61.0 0 295
## 7 276 15.3 0 276
## 8 114 3.8 0 114
## 9 78 1.0 0 70 8
## 10 66 0.2 1 57 9
## 11 37 0.1 1 37
## 12 41 0.0 1 36 5
## 13 53 0.0 1 52 1
## 14 77 0.0 1 68 9
## 15 110 0.0 1 102 8
## 16 137 0.0 1 116 21
## 17 714 0.0 1 696 18
## 18 829 0.0 1 815 14
## 19 65 0.0 1 63 2
## 20 96 0.0 1 94 2
## 21 56 0.0 1 54 2
## 22 108 0.0 1 104 4
## 23 48 0.0 1 45 3
## 24 42 0.0 1 37 5
## 25 176 0.0 1 167 9
## 26 75 0.0 1 74 1
## 27 90 0.0 1 81 9
## 28 405 0.0 1 397 8
## 29 285 0.0 1 264 21
## 30 548 0.0 1 540 8
## 31 51 0.0 1 41 10
## 32 103 0.0 1 95 8
## 33 115 0.0 1 103 12
## 34 820 0.0 1 790 30
## 35 1033 0.0 1 1020 13
## 36 133 0.0 1 129 4
## 37 57 0.0 1 57
## 38 28 0.0 1 20 8
## 39 142 0.0 1 139 3
## 40 297 0.0 1 291 6
## 41 69 0.0 1 58 11
## 42 594 0.0 1 570 24
## 43 1474 0.0 1 1457 17
## 44 200 0.0 1 189 11
## 45 216 0.0 1 211 5
## 46 74 0.0 1 65 9
## 47 222 0.0 1 208 14
## 48 664 0.0 1 651 13
## 49 134 0.0 1 128 6
## 50 69 0.0 1 65 4
## 51 45 0.0 1 43 2
## 52 42 0.0 1 41 1
## 53 89 0.0 1 81 8
## 54 525 0.0 1 500 25
## 55 761 0.0 1 756 5
## 56 38 0.0 1 30 8
## 57 118 0.0 1 117 1
## 58 64 0.0 1 62 2
## 59 17 0.0 1 16 1
## 60 123 0.0 1 123
## 61 149 0.0 1 149
## 62 74 0.0 1 66 8
## 63 535 0.0 1 527 8
## 64 341 0.0 1 331 10
## 65 524 0.0 1 497 27
## 66 1652 0.0 1 1627 25
## 67 426 0.0 1 422 4
## 68 256 0.0 1 253 3
## 69 361 0.0 1 349 12
## 70 493 0.0 1 476 17
## 71 740 0.0 1 711 29
## 72 816 0.0 1 796 20
## 73 626 0.0 1 617 9
## 74 426 0.0 1 420 6
## 75 209 0.0 1 205 4
## 76 72 0.0 1 70 2
## 77 13 0.0 1 13
## 78 4 0.0 1 3 1
## 80 1 0.0 1 1
## 82 1 0.0 1 1
## 85 1 0.0 1 0 1
## 86 1 0.0 1 0 1
## 87 2 0.0 1 2
## 89 1 0.0 1 1
## 90 2 0.0 1 2
## 92 2 0.0 1 2
## 93 2 0.0 1 2
## 102 1 0.0 1 1
## 103 1 0.0 1 0 1
## 104 2 0.0 1 2
## 105 4 0.0 1 4
## 106 1 0.0 1 1
## 111 1 0.0 1 1
## 120 1 0.0 1 1
## 123 1 0.0 1 0 1
## 125 1 0.0 1 0 1
## 126 1 0.0 1 0 1
## 136 1 0.0 1 1
## 138 1 0.0 1 0 1
## 139 1 0.0 1 0 1
## 148 1 0.0 1 0 1
## 162 1 0.0 1 0 1
## 164 1 0.0 1 0 1
## 169 1 0.0 1 0 1
## 179 1 0.0 1 0 1
## 188 1 0.0 1 0 1
## 198 1 0.0 1 0 1
Activate the fastqc environment:
conda activate fastqc
Run FastQC to generate a quality control report after trimming:
fastqc -o tutorial/fastqc tutorial/cutadapt/trimmed.fastq.gz
## Started analysis of trimmed.fastq.gz
## Approx 5% complete for trimmed.fastq.gz
## Approx 10% complete for trimmed.fastq.gz
## Approx 15% complete for trimmed.fastq.gz
## Approx 20% complete for trimmed.fastq.gz
## Approx 25% complete for trimmed.fastq.gz
## Approx 30% complete for trimmed.fastq.gz
## Approx 35% complete for trimmed.fastq.gz
## Approx 40% complete for trimmed.fastq.gz
## Approx 45% complete for trimmed.fastq.gz
## Approx 50% complete for trimmed.fastq.gz
## Approx 55% complete for trimmed.fastq.gz
## Approx 60% complete for trimmed.fastq.gz
## Approx 65% complete for trimmed.fastq.gz
## Approx 70% complete for trimmed.fastq.gz
## Approx 75% complete for trimmed.fastq.gz
## Approx 80% complete for trimmed.fastq.gz
## Approx 85% complete for trimmed.fastq.gz
## Approx 90% complete for trimmed.fastq.gz
## Approx 95% complete for trimmed.fastq.gz
## Analysis complete for trimmed.fastq.gz
Inspect the trimmed quality control report:
You should see that FastQC no longer reports any issues with the per base sequence quality or adapter contamination. However, we still receive a warning for the per sequence GC content. Is this a problem? Remember what I said previously about blindly following the FastQC test results. Does it matter that the GC content is different to this ‘theoretical’ distribution? What if the data we generated was from an assay that sampled just high GC regions on the genome? An understanding of the data generating process is crucial to assess the FastQC test results in their proper context. Having said that, it appears we have successfully ‘cleaned’ this data and can proceed with our analysis.
The exercises below are designed to strengthen your knowledge of using FastQC and Cutadapt to perform quality control on sequencing data. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.
Create a directory to store the output files from each exercise:
mkdir exercises
mkdir exercises/chipseq
mkdir exercises/rnaseq
mkdir exercises/atacseq
ChIP sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest.
In the data directory is a file called chipseq.fastq.gz which contains sequencing data from a ChIP-seq experiment. Use FastQC and Cutadapt to quality check and clean the data if necessary. Pay close attention to the ‘Overrepresented sequences’ module. Can you clean the data so it passes this module?
# Run FastQC on the raw data
conda activate fastqc
fastqc -o exercises/chipseq tutorial/data/chipseq.fastq.gz
# Trim the Illumina TruSeq adapter sequence
# Remove over represented sequences containing any N base calls
conda activate cutadapt
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --max-n 1 -o exercises/chipseq/chipseq_trimmed.fastq.gz tutorial/data/chipseq.fastq.gz
# Run FastQC on the trimmed data
conda activate fastqc
fastqc -o exercises/chipseq exercises/chipseq/chipseq_trimmed.fastq.gz
## Started analysis of chipseq.fastq.gz
## Approx 5% complete for chipseq.fastq.gz
## Approx 10% complete for chipseq.fastq.gz
## Approx 15% complete for chipseq.fastq.gz
## Approx 20% complete for chipseq.fastq.gz
## Approx 25% complete for chipseq.fastq.gz
## Approx 30% complete for chipseq.fastq.gz
## Approx 35% complete for chipseq.fastq.gz
## Approx 40% complete for chipseq.fastq.gz
## Approx 45% complete for chipseq.fastq.gz
## Approx 50% complete for chipseq.fastq.gz
## Approx 55% complete for chipseq.fastq.gz
## Approx 60% complete for chipseq.fastq.gz
## Approx 65% complete for chipseq.fastq.gz
## Approx 70% complete for chipseq.fastq.gz
## Approx 75% complete for chipseq.fastq.gz
## Approx 80% complete for chipseq.fastq.gz
## Approx 85% complete for chipseq.fastq.gz
## Approx 90% complete for chipseq.fastq.gz
## Approx 95% complete for chipseq.fastq.gz
## Approx 100% complete for chipseq.fastq.gz
## Analysis complete for chipseq.fastq.gz
## This is cutadapt 4.1 with Python 3.10.6
## Command line parameters: -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --max-n 1 -o exercises/chipseq/chipseq_trimmed.fastq.gz tutorial/data/chipseq.fastq.gz
## Processing single-end reads on 1 core ...
## Finished in 0.69 s (7 µs/read; 8.69 M reads/minute).
##
## === Summary ===
##
## Total reads processed: 100,000
## Reads with adapters: 1,670 (1.7%)
##
## == Read fate breakdown ==
## Reads with too many N: 640 (0.6%)
## Reads written (passing filters): 99,360 (99.4%)
##
## Total basepairs processed: 3,751,129 bp
## Total written (filtered): 3,707,971 bp (98.8%)
##
## === Adapter 1 ===
##
## Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA; Type: regular 3'; Length: 33; Trimmed: 1670 times
##
## Minimum overlap: 3
## No. of allowed errors:
## 1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-33 bp: 3
##
## Bases preceding removed adapters:
## A: 25.4%
## C: 15.5%
## G: 20.4%
## T: 12.3%
## none/other: 26.3%
##
## Overview of removed sequences
## length count expect max.err error counts
## 3 1041 1562.5 0 1041
## 4 150 390.6 0 150
## 5 32 97.7 0 32
## 6 2 24.4 0 2
## 9 1 0.4 0 0 1
## 10 1 0.1 1 0 1
## 37 5 0.0 3 0 1 4
## 38 438 0.0 3 0 416 20 2
## Started analysis of chipseq_trimmed.fastq.gz
## Approx 5% complete for chipseq_trimmed.fastq.gz
## Approx 10% complete for chipseq_trimmed.fastq.gz
## Approx 15% complete for chipseq_trimmed.fastq.gz
## Approx 20% complete for chipseq_trimmed.fastq.gz
## Approx 25% complete for chipseq_trimmed.fastq.gz
## Approx 30% complete for chipseq_trimmed.fastq.gz
## Approx 35% complete for chipseq_trimmed.fastq.gz
## Approx 40% complete for chipseq_trimmed.fastq.gz
## Approx 45% complete for chipseq_trimmed.fastq.gz
## Approx 50% complete for chipseq_trimmed.fastq.gz
## Approx 55% complete for chipseq_trimmed.fastq.gz
## Approx 60% complete for chipseq_trimmed.fastq.gz
## Approx 65% complete for chipseq_trimmed.fastq.gz
## Approx 70% complete for chipseq_trimmed.fastq.gz
## Approx 75% complete for chipseq_trimmed.fastq.gz
## Approx 80% complete for chipseq_trimmed.fastq.gz
## Approx 85% complete for chipseq_trimmed.fastq.gz
## Approx 90% complete for chipseq_trimmed.fastq.gz
## Approx 95% complete for chipseq_trimmed.fastq.gz
## Analysis complete for chipseq_trimmed.fastq.gz
RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome.
In the data directory is a file called `rnaseq.fastq.gz`` which contains sequencing data from an RNA-seq experiment. Use FastQC and Cutadapt to quality check and clean the data if necessary. Pay close attention to the ‘Per sequence GC content’ and ‘Overrepresented sequences’ modules. Can you clean the data so both of these modules pass? Should you?
# This is an RNA-seq library.
# Depending on the cell type and condition, the GC content of expressed genes might be very skewed.
# The assumptions of the theoretical distribution do not hold for this particular type of data.
# Additionally, it is not clear whether the over-represented sequences are from contamination.
# In an RNA-seq library we expect to see thousands of copies of the same DNA fragment, because a particular gene might be highly expressed.
# The over-represented sequences you observe actually come from a highly expressed gene in the library, not contamination.
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is a technique used in molecular biology to assess genome-wide chromatin accessibility. ATAC-seq identifies accessible DNA regions by probing open chromatin with hyperactive mutant Tn5 Transposase that inserts sequencing adapters into open regions of the genome.
In the data directory is a file called atacseq.fastq.gz which contains sequencing data from an ATAC-seq experiment. Use FastQC and Cutadapt to quality check and clean the data if necessary. Pay close attention to the ‘Per base sequence content’ and ‘Overrepresented sequences’ modules. Can you clean the data so both of these modules pass? If not, think about why?
# Run FastQC on the raw data
conda activate fastqc
fastqc -o exercises/atacseq tutorial/data/atacseq.fastq.gz
# Trim the Illumina TruSeq adapter sequence
# Remove over represented sequences containing N base calls
conda activate cutadapt
cutadapt --max-n 1 -o exercises/atacseq/atacseq_trimmed.fastq.gz tutorial/data/atacseq.fastq.gz
# Run FastQC on the trimmed data
conda activate fastqc
fastqc -o exercises/atacseq exercises/atacseq/chipseq_trimmed.fastq.gz
# The 'Per base sequence content' module fails because of the strange pattern at the beginning of the read.
# This is not due to contamination or a probably with the library.
# The ATAC-seq protocol uses a Tn5 transposase which inserts itself into regions of open DNA.
# Those regions of open DNA are what go on to be sequenced.
# The strange pattern you observe is due to a bias in the Tn5 transposase itself - it has a tendency to insert itself into regions of the DNA with a particular sequence composition.
## Started analysis of atacseq.fastq.gz
## Approx 5% complete for atacseq.fastq.gz
## Approx 10% complete for atacseq.fastq.gz
## Approx 15% complete for atacseq.fastq.gz
## Approx 20% complete for atacseq.fastq.gz
## Approx 25% complete for atacseq.fastq.gz
## Approx 30% complete for atacseq.fastq.gz
## Approx 35% complete for atacseq.fastq.gz
## Approx 40% complete for atacseq.fastq.gz
## Approx 45% complete for atacseq.fastq.gz
## Approx 50% complete for atacseq.fastq.gz
## Approx 55% complete for atacseq.fastq.gz
## Approx 60% complete for atacseq.fastq.gz
## Approx 65% complete for atacseq.fastq.gz
## Approx 70% complete for atacseq.fastq.gz
## Approx 75% complete for atacseq.fastq.gz
## Approx 80% complete for atacseq.fastq.gz
## Approx 85% complete for atacseq.fastq.gz
## Approx 90% complete for atacseq.fastq.gz
## Approx 95% complete for atacseq.fastq.gz
## Approx 100% complete for atacseq.fastq.gz
## Analysis complete for atacseq.fastq.gz
## This is cutadapt 4.1 with Python 3.10.6
## Command line parameters: --max-n 1 -o exercises/atacseq/atacseq_trimmed.fastq.gz tutorial/data/atacseq.fastq.gz
## Processing single-end reads on 1 core ...
## Finished in 0.81 s (8 µs/read; 7.39 M reads/minute).
##
## === Summary ===
##
## Total reads processed: 100,000
##
## == Read fate breakdown ==
## Reads with too many N: 1,784 (1.8%)
## Reads written (passing filters): 98,216 (98.2%)
##
## Total basepairs processed: 7,117,244 bp
## Total written (filtered): 7,053,624 bp (99.1%)
## Skipping 'exercises/atacseq/chipseq_trimmed.fastq.gz' which didn't exist, or couldn't be read